OcrV1, Main, Exploration, bibRecord, 000831

A General Learning Method for Automatic Title Extraction from HTML Pages

Identifieur interne : 000831 ( Main/Exploration ); précédent : 000830; suivant : 000832

A General Learning Method for Automatic Title Extraction from HTML Pages

Auteurs : Sahar Changuel [France] ; Nicolas Labroche [France] ; Bernadette Bouchon-Meunier [France]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2009.

RBID : ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5

Abstract

Abstract: This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.

Url:

https://api.istex.fr/document/D4D1E3040C032904E47DC6D9E7209FF37CE927F5/fulltext/pdf

DOI: 10.1007/978-3-642-03070-3_53

Affiliations:

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000396
to stream Istex, to step Curation: 000390
to stream Istex, to step Checkpoint: 000353
to stream Main, to step Merge: 000839
to stream Main, to step Curation: 000831

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">A General Learning Method for Automatic Title Extraction from HTML Pages</title>
<author><name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
</author>
<author><name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
</author>
<author><name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/978-3-642-03070-3_53</idno>
<idno type="url">https://api.istex.fr/document/D4D1E3040C032904E47DC6D9E7209FF37CE927F5/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000396</idno>
<idno type="wicri:Area/Istex/Curation">000390</idno>
<idno type="wicri:Area/Istex/Checkpoint">000353</idno>
<idno type="wicri:doubleKey">0302-9743:2009:Changuel S:a:general:learning</idno>
<idno type="wicri:Area/Main/Merge">000839</idno>
<idno type="wicri:Area/Main/Curation">000831</idno>
<idno type="wicri:Area/Main/Exploration">000831</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">A General Learning Method for Automatic Title Extraction from HTML Pages</title>
<author><name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>Laboratoire d’Informatique de Paris 6 (LIP6), DAPA, LIP6, 104, Avenue du Président Kennedy, 75016, Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2009</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">D4D1E3040C032904E47DC6D9E7209FF37CE927F5</idno>
<idno type="DOI">10.1007/978-3-642-03070-3_53</idno>
<idno type="ChapterID">53</idno>
<idno type="ChapterID">Chap53</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Île-de-France</li>
</region>
<settlement><li>Paris</li>
</settlement>
</list>
<tree><country name="France"><region name="Île-de-France"><name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
</region>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<name sortKey="Bouchon Meunier, Bernadette" sort="Bouchon Meunier, Bernadette" uniqKey="Bouchon Meunier B" first="Bernadette" last="Bouchon-Meunier">Bernadette Bouchon-Meunier</name>
<name sortKey="Changuel, Sahar" sort="Changuel, Sahar" uniqKey="Changuel S" first="Sahar" last="Changuel">Sahar Changuel</name>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
<name sortKey="Labroche, Nicolas" sort="Labroche, Nicolas" uniqKey="Labroche N" first="Nicolas" last="Labroche">Nicolas Labroche</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000831 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000831 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:D4D1E3040C032904E47DC6D9E7209FF37CE927F5
   |texte=   A General Learning Method for Automatic Title Extraction from HTML Pages
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

A General Learning Method for Automatic Title Extraction from HTML Pages

A General Learning Method for Automatic Title Extraction from HTML Pages

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri